Towards minimum perceptual error training for DNN-based speech synthesis

نویسندگان

  • Cassia Valentini-Botinhao
  • Zhizheng Wu
  • Simon King
چکیده

We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the SpectroTemporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Preliminary Work on Speaker Adaptation for Dnn-based Speech Synthesis

We investigate speaker adaptation in the context of deep neural network (DNN) based speech synthesis. More specifically, our current work focuses on the exploitation of auxiliary information such as gender, speaker identity or age during the DNN training process. The proposed technique is compared to standard acoustic feature transformations such as the feature based maximum likelihood linear r...

متن کامل

Dynamic noise aware training for speech enhancement based on deep neural networks

We propose three algorithms to address the mismatch problem in deep neural network (DNN) based speech enhancement. First, we investigate noise aware training by incorporating noise information in the test utterance with an ideal binary mask based dynamic noise estimation approach to improve DNN’s speech separation ability from the noisy signal. Next, a set of more than 100 noise types is adopte...

متن کامل

Deep neural network-based statistical parametric speech synthesis system using improved time-frequency trajectory excitation model

This paper proposes a deep neural network (DNN)-based statistical parametric speech synthesis system using an improved time-frequency trajectory excitation (ITFTE) model. The ITFTE model, which efficiently reduces the parametric redundancy of a TFTE model, improved the perceptual quality of the vocoding process and the estimation accuracy of the training process. However, there remain problems ...

متن کامل

Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis

Feed-forward deep neural networks (DNNs) based text-tospeech (TTS) synthesis, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree based, GMM-HMM counterpart. However, the DNN-based TTS training has not taken the whole sequence, i.e., sentence, int...

متن کامل

A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation

In contrast to the conventional minimum mean squared error (MMSE) training criterion for nonlinear spectral mapping based on deep neural networks (DNNs), we propose a probabilistic learning framework to estimate the DNN parameters for singlechannel speech separation. A statistical analysis of the prediction error vector at the DNN output reveals that it follows a unimodal density for each log p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015